{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Cross-validation</h3>\n",
    "<p>In this notebook, we perform a cross-validation to try and choose the polynomial order for some synthetic data.</p>\n",
    "<p>The loss (in our case, this is equal to the squared error) on the training data will always decrease as we make the model more complex, and therefore cannot be used to select how complex the model ought to be. In some cases we may be lucky enough to have an independent test set for model validation but often we will not. In a cross-validation procedure (CV) we split the data into $K$ folds, and train $K$ different models, each with a different data fold removed. This removed fold is used for testing and the performance is averaged over the $K$ different folds.</p>\n",
    "<p>This process is illustrated in the following figure:</p>\n",
    "<img src=\"files/cvdiagram.png\">\n",
    "<p>We start by generating two datasets - one on which we will perform the CV, and a second independent test set.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "N = 100\n",
    "x = 10*np.random.rand(N,1) - 5\n",
    "t = 5*x**3 - x**2 + x + 150*np.random.randn(N,1)\n",
    "N_test = 200\n",
    "x_test = np.linspace(-5,5,N_test)[:,None]\n",
    "t_test = 5*x_test**3 - x_test**2 + x_test + 150*np.random.randn(N_test,1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>We now loop over model order (from 0th up to 10th order polynomials) and for each order, perform a $K=10$ fold CV. For each fold-order combintaion we compute the mean squared loss on the training data, the mean squared loss on the held out fold and the mean squared loss on the independent test set.</p>\n",
    "<p>Note that in many cases, it would be sensible to randomise the order of the data before placing it into the folds.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "max_order = 10\n",
    "X = []\n",
    "X_test = []\n",
    "K = 10\n",
    "sizes = np.tile(np.int(N/float(10)),(1,K))\n",
    "sizes[0,-1] = sizes[0,-1] + N - sizes.sum()\n",
    "c_sizes = np.hstack((0,np.cumsum(sizes)))\n",
    "X = np.ones_like(x)\n",
    "X_test = np.ones_like(x_test)\n",
    "cv_loss = np.zeros((K,max_order+1))\n",
    "ind_loss = np.zeros((K,max_order+1))\n",
    "train_loss = np.zeros((K,max_order+1))\n",
    "\n",
    "for k in range(max_order+1):\n",
    "    for fold in range(K):\n",
    "        X_fold = X[c_sizes[fold]:c_sizes[fold+1],:]\n",
    "        X_train = np.delete(X,np.arange(c_sizes[fold],c_sizes[fold+1],1),0)\n",
    "        t_fold = t[c_sizes[fold]:c_sizes[fold+1]]\n",
    "        t_train = np.delete(t,np.arange(c_sizes[fold],c_sizes[fold+1],1),0)\n",
    "        w = np.linalg.solve(np.dot(X_train.T,X_train),np.dot(X_train.T,t_train))\n",
    "        fold_pred = np.dot(X_fold,w)\n",
    "        cv_loss[fold,k] = ((fold_pred - t_fold)**2).mean()\n",
    "        ind_pred = np.dot(X_test,w)\n",
    "        ind_loss[fold,k] = ((ind_pred - t_test)**2).mean()\n",
    "        train_pred = np.dot(X_train,w)\n",
    "        train_loss[fold,k] = ((train_pred - t_train)**2).mean()\n",
    "    X = np.hstack((X,x**(k+1)))\n",
    "    X_test = np.hstack((X_test,x_test**(k+1)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>Finally, we plot the mean performance in each case (where the mean is taken over folds).</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.text.Text at 0x105c39890>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pylab as plt\n",
    "%matplotlib inline\n",
    "order = np.arange(max_order+1)\n",
    "plt.plot(order,train_loss.mean(axis=0),'b-',label=\"Training loss\")\n",
    "plt.plot(order,cv_loss.mean(axis=0),'r-',label=\"CV loss\")\n",
    "plt.plot(order,ind_loss.mean(axis=0),'k',label=\"Independent test loss\")\n",
    "plt.legend()\n",
    "plt.xlabel('Model order')\n",
    "plt.ylabel('Mean squared loss')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>As expected, training loss (blue) always decreaes. The CV loss doesn't, and comes to a minimum at around 3rd order (this will not always be the case - it depends on the dataset generated at the start of the notebook). The independent loss typically has a more well-defined minimum.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
